Diffusion models are a class of generative models that learn the probability distribution of data by iteratively applying a series of transformations to a simple base distribution. They have been used in various applications, including image generation, text generation, and density estimation.
Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at https://github.com/zld-make/ClickRemoval under the Apache-2.0 license.
In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes, such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.
Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. However, existing UQ methods are fundamentally misaligned with this new paradigm: they assume autoregressive factorization or use expensive repeated sampling, negating the efficiency of LLDMs. In this work, we present the first systematic study of UQ for LLDMs and propose lightweight, zero-shot uncertainty signals derived from the iterative denoising process, leveraging intermediate generations, token remasking dynamics, and denoising complexity. We further adapt a state-of-the-art UQ method to LLDMs by combining masked diffusion likelihoods with trajectory-based semantic dissimilarity. We prove that expected trajectory dissimilarity lower bounds the masked diffusion training objective, which motivates its usage as an uncertainty score. Comprehensive experiments across three tasks, eight datasets, and two models show that our method achieves a great cost-performance trade-off: it approaches the strongest sampling-based baselines while incurring up to 100x lower computational overhead. Our work demonstrates that LLDMs can deliver both fast inference and reliable hallucination detection simultaneously.
Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at https://github.com/StarNextDay/SubDAPS.git.
Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.
Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of $3.86\times$.
Diffusion models generate samples by denoising along the score of a perturbed target distribution. In practice, one trains a neural diffusion model, which is computationally expensive. Recent work suggests that score matching implicitly smooths the empirical score, and that this smoothing bias promotes generalization by capturing low-dimensional data geometry. We propose moment-matched score-smoothed overdamped Langevin dynamics (MM-SOLD), a training-free interacting particle sampler that enforces the target moments throughout the sampling trajectory. We prove that, in the large-particle limit, the empirical particle density converges to a deterministic limit whose one-particle stationary marginal is a Gibbs--Boltzmann density obtained by exponentially tilting a naive score-smoothed diffusion target. The mean and covariance of this distribution agree with the empirical moments of the training data. Experiments on 2D distributions and latent-space image generation show that MM-SOLD enables fast, robust, training-free sampling on CPUs, with sample fidelity and diversity competitive with neural diffusion baselines.
Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.
Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.
Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.